pacman::p_load(sf, tidyverse, questionr, janitor, psych, ggplot2, gcookbook, tmap, ggpubr, egg, corrplot, gtsummary, regclass, caret, heatmaply, ggdendro, cluster, factoextra, spdep, ClustGeo, GGally, skimr, stringr)Regionalisation with Spatially Constrained Cluster Analysis
case study : Regionalisation by Multivariate Water Point Attributes with Non-spatially Constrained and Spatially Constrained Clustering Methods.
1. OVERVIEW
Regionalisation with Spatially Constrained clustering analysis requires similar observations to be grouped according to their statistical attributes and spatial location.
1.1 Objectives
1.2 Study Area Introduction
1.3 Scope of Works
2. R PACKAGE REQUIRED
The following are the packages required for this exercise :
2.1 Load R Packages
Usage of the code chunk below :
p_load( ) - pacman - to load packages. This function will attempt to install the package from CRAN or pacman repository list if its found not installed.
3. GEOSPATIAL DATA
3.1 Acquire Data Source
Aspatial Data
- Download the Nigeria data set in shapefile format via Access WPdx+ Global Data Repository from WPdx Global Data Repositories.
- Rename the title of the data set to “geo_export”.
The file size of the downloaded data is about 422 MB due to water points data from multiple countries.
- Such file size may require extra effort and time to manage the code chunks and files in the R environment before pushing them to GitHub.
Hence, to avoid any error in pushing files larger than 100 MB to Git, filtered Nigeria water points and removed unnecessary variables before uploading into the R environment.
Therewith, the CSV file size should be lesser than 100 MB.
Geospatial Data
- Download the Nigeria geoBoundaries data set at ADM2 level4 from geoBoundaries.org or the Humanitarian Data Exchange portal.
- Rename the title of the data set to “nga_admbnda_adm2_osgof_20190417”
4 Runfola, D. et al. (2020) geoBoundaries: A global database of political administrative boundaries. PLoS ONE 15(4): e0231866. https://doi.org/10.1371/journal.pone.0231866
3.2 Import Attribute Data
Four (4) data frames to be created for different context, i.e.
wp_coord = coordinated related variables.
wp_cond = status and conditions related variables.
wp_adm = administrative and measurements related variables.
wp = master file that combine all three (3) data frames above.
3.2.4 Create Master File
Usage of the code chunk below :
left_join( ) - dplyr - to combine wp_coord, wp_cond and wp_adm.
wp <- left_join(
(left_join
(wp_coord,wp_cond,
by = c("row_id")
)
),
wp_adm, by = c("row_id"))3.2.5 Convert Well Known Text (WKT) Data to SF Data Frame
The “New Georeferenced Column” in wp_rds contains spatial data in a WKT format.
Two (2) steps to convert the WKT data format into an sf data frame.
3.2.5.1 derive new field :: “geometry”
Usage of the code chunk below :
st_as_sfc( ) - sf - to convert foreign geometry object `New Georeferenced Column` to an sfc object
wp$geometry = st_as_sfc(wp$`New Georeferenced Column`)3.2.5.2 convert to SF Data Frame
Usage of the code chunk below :
st_sf( ) - sf - to convert the tibble data frame into sf data frame with crs first set to WGS 84 (EPSG : 4326).
st_crs( ) - sf - to retrieve coordinate reference system from the object.
wp_sf<- st_sf(wp, crs = 4326)
st_crs(wp_sf)Coordinate Reference System:
User input: EPSG:4326
wkt:
GEOGCRS["WGS 84",
ENSEMBLE["World Geodetic System 1984 ensemble",
MEMBER["World Geodetic System 1984 (Transit)"],
MEMBER["World Geodetic System 1984 (G730)"],
MEMBER["World Geodetic System 1984 (G873)"],
MEMBER["World Geodetic System 1984 (G1150)"],
MEMBER["World Geodetic System 1984 (G1674)"],
MEMBER["World Geodetic System 1984 (G1762)"],
MEMBER["World Geodetic System 1984 (G2139)"],
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]],
ENSEMBLEACCURACY[2.0]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["geodetic latitude (Lat)",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433]],
AXIS["geodetic longitude (Lon)",east,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433]],
USAGE[
SCOPE["Horizontal component of 3D system."],
AREA["World."],
BBOX[-90,-180,90,180]],
ID["EPSG",4326]]
3.3 Import Boundary Data
Usage of the code chunk below :
st_read( ) - sf - to read simple features.
select( ) - dplyr - to select “shapeName” variable.
bdy_nga <- st_read(dsn = "data/geospatial",
layer = "geoBoundaries-NGA-ADM2",
crs = 4326) %>%
select(shapeName)Reading layer `geoBoundaries-NGA-ADM2' from data source
`D:\jephOstan\ISSS624\class_project\project_2\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 774 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 2.668534 ymin: 4.273007 xmax: 14.67882 ymax: 13.89442
Geodetic CRS: WGS 84
problems(bdy_nga)3.3.1 Review Imported File
3.3.1.1 check for missing and duplicated data
skim(bdy_nga)Warning: Couldn't find skimmers for class: sfc_MULTIPOLYGON, sfc; No
user-defined `sfl` provided. Falling back to `character`.
| Name | bdy_nga |
| Number of rows | 774 |
| Number of columns | 2 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| shapeName | 0 | 1 | 3 | 18 | 0 | 768 | 0 |
| geometry | 0 | 1 | 878 | 33370 | 0 | 774 | 0 |
Remarks :
There is no missing data.
There 774 unique “geometry” but only 768 unique “shapeName”
- That’s mean there are 6 values of “shapeName” duplicated among the identified unique shapeNames.
3.3.1.2 list the unique “shapeName” associated with duplication
Usage of the code chunk below :
add_count( ) - dplyr - to count observations by group
filter( ) - dplyr - to retain shapeName that has count not equal to 1.
dupl_shapeName <- bdy_nga %>%
add_count(bdy_nga$shapeName) %>%
filter(n!=1) %>%
select(-n)
freq(dupl_shapeName$shapeName) n % val%
Bassa 2 16.7 16.7
Ifelodun 2 16.7 16.7
Irepodun 2 16.7 16.7
Nasarawa 2 16.7 16.7
Obi 2 16.7 16.7
Surulere 2 16.7 16.7
3.3.1.3 verify findings in section 3.3.1.2
Usage of the code chunk below :
tmap_mode( ) - tmap - to set tmap mode to static plotting or interactive.
tm_shape( ) - tmap - to specify the shape object.
tm_polygons( ) - tmap - to fill the polygons and draw the polygon borders.
tm_view( ) - tmap - to set the options for the interactive tmap viewer.
tm_fill( ) - tmap - to specify either which colour to be used or which data variable mapped to the colour palette.
tm_borders( ) - tmap - to draw the polygon borders.
tmap_style( ) - tmap - to set the tmap style.
tm_layout( ) - tmap - to set the layout of cartographic map.
tmap_mode("view")tmap mode set to interactive viewing
tm_shape(bdy_nga)+
tm_polygons()+
tm_view(set.zoom.limits = c(6,8))+
tm_shape(dupl_shapeName)+
tm_fill("shapeName",
n = 6,
style = "jenks")+
tm_borders(alpha = 0.5)+
tmap_style("albatross")+
tm_layout(main.title = "Distribution of Duplicated ShapeName",
main.title.size = 1.3,
main.title.position = "center")tmap style set to "albatross"
other available styles are: "white", "gray", "natural", "cobalt", "col_blind", "beaver", "bw", "classic", "watercolor"
Remarks :
The plot above indicates those duplicated water points are not within the same location.
tmap_mode("plot")tmap mode set to plotting
3.3.1.4 acquire State info for duplicated value
The State info to be combined with the duplicated “shapeName”. This will make all the shapeName unique.
dupl_shapeName %>%
mutate(st_centroid(dupl_shapeName$geometry, of_largest_polygon = FALSE))Simple feature collection with 12 features and 2 fields
Active geometry column: geometry
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 3.316459 ymin: 6.459038 xmax: 9.020704 ymax: 12.05035
Geodetic CRS: WGS 84
First 10 features:
shapeName bdy_nga$shapeName geometry
1 Bassa Bassa MULTIPOLYGON (((6.708541 7....
2 Bassa Bassa MULTIPOLYGON (((8.823522 10...
3 Ifelodun Ifelodun MULTIPOLYGON (((4.664107 8....
4 Ifelodun Ifelodun MULTIPOLYGON (((4.721977 7....
5 Irepodun Irepodun MULTIPOLYGON (((5.05493 8.0...
6 Irepodun Irepodun MULTIPOLYGON (((4.543349 7....
7 Nasarawa Nasarawa MULTIPOLYGON (((8.554589 11...
8 Nasarawa Nasarawa MULTIPOLYGON (((7.493228 8....
9 Obi Obi MULTIPOLYGON (((8.191919 6....
10 Obi Obi MULTIPOLYGON (((9.008576 8....
st_centroid(dupl_shapeName$geometry, of_largest_polygon = FALSE)
1 POINT (7.031827 7.791971)
2 POINT (8.782946 10.08015)
3 POINT (5.052235 8.544311)
4 POINT (4.636735 7.920948)
5 POINT (4.926215 8.169349)
6 POINT (4.498797 7.84861)
7 POINT (8.578262 12.00446)
8 POINT (7.760272 8.304034)
9 POINT (8.281026 7.022495)
10 POINT (8.734777 8.35534)
| lga | row_id | headquarter | state | iso3166code | dupl_shapeName_coord | lga_office_coord |
|---|---|---|---|---|---|---|
| Bassa | 94 | Oguma | Kogi | NG.KO.BA | 7.791971, 7.031827 | 7.80932, 6.74853 |
| Bassa | 95 | Bassa | Plateau | NG.PL.BA | 10.08015, 8.782946 | 10.11143, 8.71559 |
| Ifelodun | 304 | Share | Kwara | NG.KW.IF | 8.544311, 5.052235 | 8.5 5.0 |
| Ifelodun | 305 | Ikirun | Osun | NG.OS.ID | 7.920948, 4.636735 | 7.5 4.5 |
| Irepodun | 355 | Omu Aran | Kwara | NG.KW.IR | 8.169349, 4.926215 | 8.5 5.0 |
| Irepodun | 356 | Ilobu | Osun | NG.OS.IP | 7.84861, 4.498797 | 7.5 4.5 |
| Nasarawa | 519 | Bompai | Kano | NG.KN.NA | 12.00446, | |
| 8.578262 | 11.5 8.5 | |||||
| Nasarawa | 520 | Nasarawa | Nasarawa | NG.NA.NA | 8.304034, 7.760272 | 8.53477, 7.70153 |
| Obi | 546 | Obarike-Ito | Benue | NG.BE.OB | 7.022495, 8.281026 | 7.01129, 8.33118 |
| Obi | 547 | Obi | Nasarawa | NG.NA.OB | 8.35534, 8.734777 | 8.37944, 8.78561 |
| Surelere | 693 | Surelere | Lagos | NG.LA.SU | 6.493217, 3.346919 | 6.50048, 3.35488 |
| Surelere | 694 | Iresa-Adu | Oyo | NG.OY.SU | 8.088897, 4.393574 | 8.08459, 4.38538 |
3.4 Data Wrangling
3.4.1 Edit Duplicated Value :: “shapeName”
Two (2) Main Steps involved :
- Combine “shapeName” with the State name to make each of them unique.
- Replace the “shapeName” value according to each row index.5
5 Ong Z.R.J. (2022). Geospatial Analytics for Social Good - Understanding Nigeria Water functional and non-functional water point rate. https://jordan-isss624-geospatial.netlify.app/posts/geo/geospatial_exercise/#checking-of-duplicated-area-name
bdy_nga$shapeName[c(94,95,304,305,355,356,519,520,546,547,693,694)] <-
c("Bassa Kogi",
"Bassa Plateau",
"Ifelodun Kwara",
"Ifelodun Osun",
"Irepodun Kwara",
"Irepodun Osun",
"Nasarawa Kano",
"Nasarawa Nasarawa",
"Obi Nasarawa",
"Obi Benue",
"Surulere Lagos",
"Surulere Oyo")
bdy_nga$shapeName[c(94,95,304,305,355,356,519,520,546,547,693,694)] [1] "Bassa Kogi" "Bassa Plateau" "Ifelodun Kwara"
[4] "Ifelodun Osun" "Irepodun Kwara" "Irepodun Osun"
[7] "Nasarawa Kano" "Nasarawa Nasarawa" "Obi Nasarawa"
[10] "Obi Benue" "Surulere Lagos" "Surulere Oyo"
3.4.1.1 validate edited value :: “shapeName”
dupl_shapeName_val <- bdy_nga %>%
add_count(bdy_nga$shapeName) %>%
filter(n!=1) %>%
select(-n)
dupl_shapeName_valSimple feature collection with 0 features and 2 fields
Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
Geodetic CRS: WGS 84
[1] shapeName bdy_nga$shapeName geometry
<0 rows> (or 0-length row.names)
3.4.2 Perform Point-in-Polygon Overlay
Combine both attribute and boundary of the water points into a simple feature object.
3.4.2.1 join objects :: wp_sf and bdy_nga
Usage of the code chunk below :
st_join( ) - sf - to join sf-class objects based on geometry, namely, wp_sf and bdy_nga.
wp_joined <- st_join(x = wp_sf,
y = bdy_nga,
join = st_intersects,
left = TRUE)3.4.2.2 save and read RDS File :: wp_joined
write_rds(wp_joined,"data/geodata/wp_joined.rds",compress = "xz")
wp_joined <- read_rds("data/geodata/wp_joined.rds")3.4.2.3 inspect joined file :: wp_joined
-- retrieve geometry summary
st_geometry(wp_joined)Geometry set for 95008 features
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 2.707441 ymin: 4.301812 xmax: 14.21828 ymax: 13.86568
Geodetic CRS: WGS 84
First 5 geometries:
POINT (10.47318 10.60104)
POINT (6.95009 6.78599)
POINT (7.615451 6.799595)
POINT (7.30539 6.30817)
POINT (10.44625 10.50681)
-- assess uniqueness of Water Point
wp_joined %>% janitor::get_dupes(shapeName,lat_lon_deg)No duplicate combinations found of: shapeName, lat_lon_deg
Simple feature collection with 0 features and 24 fields
Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
Geodetic CRS: WGS 84
# A tibble: 0 × 25
# … with 25 variables: shapeName <chr>, lat_lon_deg <chr>, dupe_count <int>,
# row_id <dbl>, lat_deg <dbl>, lon_deg <dbl>, New Georeferenced Column <chr>,
# water_source <chr>, water_source_clean <chr>, water_source_category <chr>,
# water_tech_clean <chr>, water_tech_category <chr>, status_clean <chr>,
# status <chr>, clean_adm1 <chr>, clean_adm2 <chr>,
# water_point_population <dbl>, local_population_1km <dbl>,
# crucialness_score <dbl>, pressure_score <dbl>, usage_capacity <dbl>, …
Remarks :
Each water point observation is unique as there are no duplicate combination of “shapeName” together with “lat_lon_deg”.
-- determine reference point :: “shapeName” or “clean_adm2”
wp_reference <- (wp_joined$shapeName == wp_joined$clean_adm2)
freq(wp_reference) n % val%
FALSE 29856 31.4 31.4
TRUE 65123 68.5 68.6
NA 29 0.0 NA
Remarks :
There are 29,713 “FALSE”, which is more than 30% of LGA names mismatched between “shapeName” and “clean_adm2”.
Since the geoBoundaries data is collected from government-published and reliable internet sources.6
- Hence, the “shapeName” variable to be used throughout this study.
The 29 NA’s are 29 water points located beyond the LGA boundary, as shown in the plot below.
6 Daniel et. al (2020) geoBoundaries: A global database of political administrative boundaries. PlosOne. https://doi.org/10.1371/journal.pone.0231866
tmap_mode("view")tmap mode set to interactive viewing
tm_shape(bdy_nga) +
tm_polygons() +
tm_view(set.zoom.limits = c(5.5, 12))+
tm_shape(filter(wp_joined, is.na(wp_reference))) +
tm_dots(size=0.1,
col="red")tmap_mode("plot")tmap mode set to plotting
3.4.3 Remove Water Point Outside LGA Boundary
wp_joined1 <- wp_joined %>% filter(shapeName == clean_adm2 | shapeName != clean_adm2)3.4.4 Replace “NA” with “Unknown”
3.4.4.1 identify variable with missing value
skim(wp_joined1)Warning: Couldn't find skimmers for class: sfc_POINT, sfc; No user-defined `sfl`
provided. Falling back to `character`.
| Name | wp_joined1 |
| Number of rows | 94979 |
| Number of columns | 24 |
| _______________________ | |
| Column type frequency: | |
| character | 13 |
| logical | 1 |
| numeric | 10 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| New Georeferenced Column | 0 | 1.00 | 11 | 45 | 0 | 94979 | 0 |
| lat_lon_deg | 0 | 1.00 | 8 | 42 | 0 | 94979 | 0 |
| water_source | 0 | 1.00 | 3 | 32 | 0 | 16 | 0 |
| water_source_clean | 302 | 1.00 | 8 | 22 | 0 | 5 | 0 |
| water_source_category | 302 | 1.00 | 4 | 11 | 0 | 3 | 0 |
| water_tech_clean | 10054 | 0.89 | 8 | 26 | 0 | 11 | 0 |
| water_tech_category | 10054 | 0.89 | 8 | 15 | 0 | 4 | 0 |
| status_clean | 10648 | 0.89 | 9 | 32 | 0 | 8 | 0 |
| status | 10648 | 0.89 | 14 | 156 | 0 | 834 | 0 |
| clean_adm1 | 0 | 1.00 | 3 | 25 | 0 | 37 | 0 |
| clean_adm2 | 0 | 1.00 | 3 | 19 | 0 | 753 | 0 |
| geometry | 0 | 1.00 | 7 | 37 | 0 | 94979 | 0 |
| shapeName | 0 | 1.00 | 3 | 18 | 0 | 761 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| is_urban | 0 | 1 | 0.21 | FAL: 75423, TRU: 19556 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| row_id | 0 | 1.00 | 199938.21 | 189720.69 | 10732.00 | 52622.50 | 86939.00 | 323664.50 | 681838.00 | ▇▃▂▂▁ |
| lat_deg | 0 | 1.00 | 9.33 | 2.48 | 4.30 | 7.36 | 9.09 | 11.83 | 13.87 | ▃▇▅▅▆ |
| lon_deg | 0 | 1.00 | 7.50 | 2.25 | 2.71 | 5.52 | 7.89 | 9.08 | 14.22 | ▃▃▇▃▁ |
| water_point_population | 537 | 0.99 | 1246.24 | 4027.60 | 0.00 | 117.00 | 413.00 | 1169.00 | 384595.00 | ▇▁▁▁▁ |
| local_population_1km | 537 | 0.99 | 3723.38 | 7418.08 | 0.00 | 597.00 | 1756.00 | 4393.00 | 384595.00 | ▇▁▁▁▁ |
| crucialness_score | 6873 | 0.93 | 0.41 | 0.34 | 0.00 | 0.13 | 0.30 | 0.63 | 1.00 | ▇▅▃▁▅ |
| pressure_score | 6873 | 0.93 | 3.21 | 9.04 | 0.00 | 0.40 | 1.18 | 3.10 | 776.97 | ▇▁▁▁▁ |
| usage_capacity | 0 | 1.00 | 488.63 | 310.95 | 50.00 | 300.00 | 300.00 | 1000.00 | 1000.00 | ▁▇▁▁▃ |
| staleness_score | 0 | 1.00 | 44.94 | 6.29 | 23.13 | 41.49 | 42.87 | 44.34 | 99.00 | ▁▇▁▁▁ |
| rehab_priority | 53089 | 0.44 | 1545.55 | 5244.05 | 0.00 | 136.00 | 522.00 | 1527.00 | 384595.00 | ▇▁▁▁▁ |
3.4.4.2 replace “NA” with “Unknown”
mutate( ) - dplyr - to run replace_na( ) function.
- replace_na( ) - tidyr - to replace NAs with “unknown”.
wp_joined1 <- wp_joined1 %>%
mutate(status_clean = replace_na(status_clean, "Unknown")) %>%
mutate(water_tech_category = replace_na(water_tech_category, "Unknown")) %>%
mutate(status = replace_na(status, "Unknown")) %>%
mutate(water_point_population = replace_na(water_point_population, 0)) %>%
mutate(local_population_1km = replace_na(local_population_1km, 0)) %>%
mutate(crucialness_score = replace_na(crucialness_score, 0)) %>%
mutate(pressure_score = replace_na(pressure_score, 0))unique(wp_joined1$status_clean)[1] "Non-Functional" "Unknown"
[3] "Functional" "Functional but needs repair"
[5] "Abandoned/Decommissioned" "Functional but not in use"
[7] "Non-Functional due to dry season" "Abandoned"
[9] "Non functional due to dry season"
3.4.5 Standardise Value
3.4.5.1 combine value :: “status_clean”
wp_joined1$status_clean[wp_joined1$status_clean == "Non functional due to dry season"] <- "Non-Functional due to dry season"
wp_joined1$status_clean[wp_joined1$status_clean == "Abandoned"] <- "Abandoned/Decommissioned"3.4.5.2 review “status_clean”
unique(wp_joined1$status_clean)[1] "Non-Functional" "Unknown"
[3] "Functional" "Functional but needs repair"
[5] "Abandoned/Decommissioned" "Functional but not in use"
[7] "Non-Functional due to dry season"
3.4.4.3 read RDS file :: wp_joined1
wp_joined1 <- read_rds("data/geodata/wp_joined1.rds")3.4.5 Extract Water Point for New Table :: wp_nga
3.4.5.1 extract functional water point
wpt_functional <- wp_joined1 %>%
filter(status_clean %in%
c("Functional",
"Functional but not in use",
"Functional but needs repair"))-- save and read RDS file :: wpt_functional
write_rds(wpt_functional,"data/geodata/wpt_functional.rds",compress = "xz")
wpt_functional <- read_rds("data/geodata/wpt_functional.rds")3.4.5.2 inspect variable and value
-- reveal value :: “status_clean”
freq(wpt_functional$status_clean) n % val%
Functional 45883 88.0 88.0
Functional but needs repair 4579 8.8 8.8
Functional but not in use 1686 3.2 3.2
length(wpt_functional$row_id)[1] 52148
length(wpt_functional$row_id)/length(wp_joined1$row_id)*100[1] 54.88801
Remarks :
The total functional water points is 52,148 which is about 54.89% of total water points.
-- reveal value :: “usage_capacity”
freq(wpt_functional$usage_capacity) n % val%
50 2 0.0 0.0
250 75 0.1 0.1
300 38064 73.0 73.0
1000 14007 26.9 26.9
-- reveal value “usage_capacity” by “status_clean”
wpt_functional %>% count(status_clean, usage_capacity, sort = TRUE)Simple feature collection with 10 features and 3 fields
Geometry type: MULTIPOINT
Dimension: XY
Bounding box: xmin: 2.711632 ymin: 4.302938 xmax: 13.5022 ymax: 13.86331
Geodetic CRS: WGS 84
# A tibble: 10 × 4
status_clean usage_capacity n geometry
* <chr> <dbl> <int> <MULTIPOINT [°]>
1 Functional 300 33687 ((3.064921 7.994882), (3.06…
2 Functional 1000 12124 ((3.080189 7.99252), (3.085…
3 Functional but needs repair 300 3306 ((3.340832 8.037962), (3.34…
4 Functional but needs repair 1000 1271 ((3.373801 7.992051), (3.33…
5 Functional but not in use 300 1071 ((3.046639 8.017765), (2.88…
6 Functional but not in use 1000 612 ((3.088655 8.005296), (3.05…
7 Functional 250 70 ((3.355785 6.498105), (3.67…
8 Functional but not in use 250 3 ((8.032945 6.878883), (7.00…
9 Functional 50 2 ((7.027967 4.765731), (8.92…
10 Functional but needs repair 250 2 ((6.465915 5.826699), (7.93…
-- reveal value :: “crucialness_score”
summary(wpt_functional$crucialness_score == 1) Mode FALSE TRUE
logical 47006 5142
-- determine the total population within 1 km by “crucialness_score”
freq(wpt_functional$crucialness_score == 1) n % val%
FALSE 47006 90.1 90.1
TRUE 5142 9.9 9.9
sum(wpt_functional[wpt_functional$crucialness_score == 1,]$local_population_1km)[1] 11252574
Remarks :
Given the “crucialness_score” is a ratio of current water point users to the total population within 1 km radius thereof :
Currently, 5,142 water points serve the population within a 1 km radius at its capacity limit.
The usage capacity may need to be increased to sustain or improve the growth or development rate within 1km of these water points.
Should the population within 1 km therefrom grow above 11,252,574, there may be multiple repercussions in resources management, urbanisation progress, local food and beverage consumption, local commodity prices, or worst case scenario would be the stability of civil society.
summary(wpt_functional$pressure_score > 1) Mode FALSE TRUE
logical 27469 24679
length(wpt_functional$pressure_score)[1] 52148
24679/52148*100 #percentage of functional waterpoints over their usage limit[1] 47.32492
Remarks :
Given the “pressure_score” is the ratio of the current water point users to the usage capacity thereof :
- 24,679, or 47.32% of functional water points, are currently over their limit of usage capacity.
3.4.5.3 Exploratory Data Analysis (EDA) :: wpt_functional
-- plot “status_clean”
ggplot(data = wpt_functional,
aes(fct_infreq(status_clean), fill=status_clean))+
geom_bar()+
geom_text(
aes(label=after_stat(count)),
stat='count',
nudge_x=-0.25,
vjust=-0.2)+
geom_text(
aes(label= scales::percent(signif(after_stat(count/sum(count))))),
stat='count',
nudge_x=0.25,
vjust=-0.2)+
scale_x_discrete(guide = guide_axis(n.dodge = 2))+
guides(fill=guide_legend (title = 'Status'))
-- plot “water_tech_category”
ggplot(data=wpt_functional,
aes(x=fct_infreq(
water_tech_category)))+
geom_bar(aes(
fill = water_tech_category),
width = 0.8)+
geom_text(aes(
label = ..count..),
stat = "count",
vjust=-0.2,
size = 3.5,
color = "black")+
scale_x_discrete(guide = guide_axis(n.dodge = 2))+
guides(fill=guide_legend (title = 'Water Tech Deployed'))Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(count)` instead.

-- plot “water_source_clean”
ggplot(data=wpt_functional,
aes(x=fct_infreq(
water_source_clean)))+
geom_bar(aes(
fill = water_source_clean),
width = 0.8)+
geom_text(aes(
label = ..count..),
stat = "count",
vjust=-0.2,
size = 3.5,
color = "black")+
scale_x_discrete(guide = guide_axis(
n.dodge = 2))+
guides(fill=guide_legend (
title = 'Source of Water Supply'))
3.4.5.4 add attribute to new data table
wp_nga <- bdy_nga %>%
mutate(`total_wp` = lengths(
st_intersects(bdy_nga, wp_joined1))) %>%
mutate(`wp_functional` = lengths(
st_intersects(bdy_nga, wpt_functional))) %>%
mutate(`pct_functional` = (`wp_functional`/`total_wp`*100))-- replace “NaN” with 0
wp_nga <- wp_nga %>%
mutate(`pct_functional` = replace_na(pct_functional, 0))
summary(wp_nga) shapeName geometry total_wp wp_functional
Length:774 MULTIPOLYGON :774 Min. : 0.0 Min. : 0.00
Class :character epsg:4326 : 0 1st Qu.: 45.0 1st Qu.: 17.00
Mode :character +proj=long...: 0 Median : 96.0 Median : 45.50
Mean :122.7 Mean : 67.36
3rd Qu.:168.8 3rd Qu.: 87.75
Max. :894.0 Max. :752.00
pct_functional
Min. : 0.00
1st Qu.: 32.61
Median : 47.41
Mean : 49.84
3rd Qu.: 66.99
Max. :100.00
3.4.5.5 extract non-functional water point
wpt_nonFunctional <- wp_joined1 %>%
filter(status_clean %in%
c("Abandoned/Decommissioned",
"Non-Functional",
"Non-Functional due to dry season"))-- save and read RDS file :: wpt_nonFuntional
write_rds(wpt_nonFunctional,"data/geodata/wpt_nonFunctional.rds",compress = "xz")
wpt_nonFunctional <- read_rds("data/geodata/wpt_nonFunctional.rds")3.4.5.6 inspect variable and value
-- reveal value :: “status_clean”
freq(wpt_nonFunctional$status_clean) n % val%
Abandoned/Decommissioned 234 0.7 0.7
Non-Functional 29385 91.7 91.7
Non-Functional due to dry season 2410 7.5 7.5
length(wpt_nonFunctional$row_id)[1] 32029
length(wpt_nonFunctional$row_id)/length(wp_joined1$row_id)*100[1] 33.7119
Remarks :
There are 32,204, which is about 33.9% out of total water points.
-- reveal value :: “usage_capacity”
freq(wpt_nonFunctional$usage_capacity) n % val%
250 41 0.1 0.1
300 20586 64.3 64.3
1000 11402 35.6 35.6
-- reveal value “usage_capacity” by “status_clean”
wpt_nonFunctional %>% count(status_clean, usage_capacity, sort = TRUE)Simple feature collection with 7 features and 3 fields
Geometry type: MULTIPOINT
Dimension: XY
Bounding box: xmin: 2.707441 ymin: 4.301812 xmax: 13.4192 ymax: 13.86567
Geodetic CRS: WGS 84
# A tibble: 7 × 4
status_clean usage_capac…¹ n geometry
* <chr> <dbl> <int> <MULTIPOINT [°]>
1 Non-Functional 300 18492 ((3.064526 7.994448), (3…
2 Non-Functional 1000 10852 ((3.083391 7.993231), (3…
3 Non-Functional due to dry season 300 2012 ((3.051752 7.984243), (3…
4 Non-Functional due to dry season 1000 398 ((3.056661 7.985808), (3…
5 Abandoned/Decommissioned 1000 152 ((4.713438 7.891137), (4…
6 Abandoned/Decommissioned 300 82 ((3.199483 8.912549), (2…
7 Non-Functional 250 41 ((3.976195 6.582998), (3…
# … with abbreviated variable name ¹usage_capacity
-- reveal “crucialness_score”
sum(wpt_nonFunctional$local_population_1km)[1] 93999535
sum(wpt_nonFunctional$water_point_population)[1] 46255888
Remarks :
Given the “crucialness_score” is a ratio of current water point users to the total population within a 1 km radius thereof , in the context of non-functional :
- Currently, out of 95,013,340 residents within a 1 km radius, there are 46,710,127 of them is affected by these non-functional water point.
3.4.5.7 EDA :: wpt_nonFunctional
-- plot “status_clean”
ggplot(data = wpt_nonFunctional,
aes(fct_infreq(status_clean),
fill=status_clean))+
geom_bar()+
geom_text(
aes(label=after_stat(count)),
stat='count',
nudge_x=-0.25,
vjust=-0.2)+
geom_text(
aes(label= scales::percent(
signif(
after_stat(
count/sum(count)
)))),
stat='count',
nudge_x=0.25,
vjust=-0.2)+
scale_x_discrete(
guide = guide_axis(
n.dodge = 2))+
guides(fill=guide_legend (
title = 'Status'))
-- plot “water_tech_category”
ggplot(data=wpt_nonFunctional,
aes(fct_infreq(
water_tech_category)))+
geom_bar(aes(
fill = water_tech_category),
width = 0.8)+
geom_text(aes(
label = ..count..),
stat = "count",
vjust=-0.2,
size = 3.5,
color = "black")+
scale_x_discrete(guide = guide_axis(
n.dodge = 2))+
guides(fill=guide_legend (
title = 'Water Tech Deployed'))
-- plot “water_source_clean”
ggplot(data=wpt_nonFunctional,
aes(fct_infreq(
water_source_clean)))+
geom_bar(aes(
fill = water_source_clean),
width = 0.8)+
geom_text(aes(
label = ..count..),
stat = "count",
vjust=-0.2,
size = 3.5,
color = "black")+
scale_x_discrete(guide = guide_axis(
n.dodge = 2))+
guides(fill=guide_legend (
title = 'Source of Water Supply'))
3.4.5.8 add wpt_nonFunctional to wp_nga
wp_nga <- wp_nga %>%
mutate(`wp_nonFunctional` = lengths(
st_intersects(bdy_nga, wpt_nonFunctional))) %>%
mutate(`pct_nonFunctional` = (`wp_nonFunctional`/`total_wp`*100))-- replace “NaN” with 0
wp_nga <- wp_nga %>%
mutate(`pct_nonFunctional` = replace_na(pct_nonFunctional, 0))
summary(wp_nga) shapeName geometry total_wp wp_functional
Length:774 MULTIPOLYGON :774 Min. : 0.0 Min. : 0.00
Class :character epsg:4326 : 0 1st Qu.: 45.0 1st Qu.: 17.00
Mode :character +proj=long...: 0 Median : 96.0 Median : 45.50
Mean :122.7 Mean : 67.36
3rd Qu.:168.8 3rd Qu.: 87.75
Max. :894.0 Max. :752.00
pct_functional wp_nonFunctional pct_nonFunctional
Min. : 0.00 Min. : 0.00 Min. : 0.00
1st Qu.: 32.61 1st Qu.: 12.00 1st Qu.: 20.77
Median : 47.41 Median : 33.50 Median : 34.89
Mean : 49.84 Mean : 41.37 Mean : 35.58
3rd Qu.: 66.99 3rd Qu.: 60.00 3rd Qu.: 50.00
Max. :100.00 Max. :278.00 Max. :100.00
3.4.5.9 extract unknown water point
wpt_unknown <- wp_joined1 %>%
filter(status_clean == "Unknown")-- save and read RDS file :: wpt_unknown
write_rds(wpt_unknown,"data/geodata/wpt_unknown.rds")
wpt_unknown <- read_rds("data/geodata/wpt_unknown.rds")3.4.5.10 inspect variable and value
-- reveal value :: “status_clean”
length(wpt_unknown$row_id)[1] 10656
length(wpt_unknown$row_id)/length(wp_joined1$row_id)*100[1] 11.2159
Remarks :
There are 10,656 water points with unknown status, about 11.22% of total water points.
-- determine affected population
sum(wpt_unknown$water_point_population)[1] 18831488
sum(wpt_unknown$local_population_1km)[1] 31418651
3.4.5.11 add wpt_unknown to wp_nga
wp_nga <- wp_nga %>%
mutate(`wp_unknown` = lengths(
st_intersects(bdy_nga, wpt_unknown))) %>%
mutate(`pct_unknown` = (`wp_unknown`/`total_wp`*100))-- replace “NaN” with 0
wp_nga <- wp_nga %>%
mutate(`pct_unknown` = replace_na(pct_unknown, 0))
summary(wp_nga) shapeName geometry total_wp wp_functional
Length:774 MULTIPOLYGON :774 Min. : 0.0 Min. : 0.00
Class :character epsg:4326 : 0 1st Qu.: 45.0 1st Qu.: 17.00
Mode :character +proj=long...: 0 Median : 96.0 Median : 45.50
Mean :122.7 Mean : 67.36
3rd Qu.:168.8 3rd Qu.: 87.75
Max. :894.0 Max. :752.00
pct_functional wp_nonFunctional pct_nonFunctional wp_unknown
Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
1st Qu.: 32.61 1st Qu.: 12.00 1st Qu.: 20.77 1st Qu.: 0.00
Median : 47.41 Median : 33.50 Median : 34.89 Median : 0.00
Mean : 49.84 Mean : 41.37 Mean : 35.58 Mean : 13.76
3rd Qu.: 66.99 3rd Qu.: 60.00 3rd Qu.: 50.00 3rd Qu.: 17.75
Max. :100.00 Max. :278.00 Max. :100.00 Max. :219.00
pct_unknown
Min. : 0.00
1st Qu.: 0.00
Median : 0.00
Mean : 12.55
3rd Qu.: 20.83
Max. :100.00
3.4.5.12 visualise distribution :: “status_clean”
Usage of the code chunk below :
qtm( ) - tmap - to plot a thematic map quickly.
tmap_arrange( ) - tmap - to arrange small multiples in grid layout.
total_wp <- qtm(wp_nga,"total_wp")+
tm_layout(legend.height = 0.3, legend.width = 0.5)
wp_functional <- qtm(wp_nga,"wp_functional")+
tm_layout(legend.height = 0.3, legend.width = 0.5)
wp_nonFunctional <- qtm(wp_nga,"wp_nonFunctional")+
tm_layout(legend.height = 0.3, legend.width = 0.5)
wp_unknown <- qtm(wp_nga,"wp_unknown")+
tm_layout(legend.height = 0.3, legend.width = 0.5)
tmap_arrange(total_wp, wp_functional, wp_nonFunctional, wp_unknown, asp=0, ncol = 2, nrow = 2, widths = 5, heights = 10, sync = TRUE)
3.4.5.13 extract “water_tech_category” to wp_nga
freq(wp_joined1$water_tech_category, sort = "dec") n % val%
Hand Pump 58755 61.8 61.8
Mechanized Pump 25644 27.0 27.0
Unknown 10055 10.6 10.6
Tapstand 553 0.6 0.6
Rope and Bucket 1 0.0 0.0
Remarks :
Only “Hand Pump”, “Mechanized Pump”, and “Tapstand” are to be extracted for further analysis as the rest are either less than 0.5% or “Unknown”.
wtc_hPump <- wp_joined1 %>%
filter(water_tech_category %in%
"Hand Pump")
wtc_mPump <- wp_joined1 %>%
filter(water_tech_category %in%
"Mechanized Pump")
wtc_tStand <- wp_joined1 %>%
filter(water_tech_category %in%
"Tapstand")
wp_nga <- wp_nga %>%
mutate(`total_handPump` = lengths(
st_intersects(bdy_nga, wtc_hPump)
)) %>%
mutate(`total_mechPump` = lengths(
st_intersects(bdy_nga, wtc_mPump)
)) %>%
mutate(`total_tapStand` = lengths(
st_intersects(bdy_nga, wtc_tStand)
)) %>%
mutate(`pct_handPump` = (`total_handPump`/`total_wp`*100)) %>%
mutate(`pct_mechPump` = (`total_mechPump`/`total_wp`*100)) %>%
mutate(`pct_tapStand` = (`total_tapStand`/`total_wp`*100))-- replace “NaN” with 0
wp_nga <- wp_nga %>%
mutate(`pct_handPump` = replace_na(pct_handPump, 0)) %>%
mutate(`pct_mechPump` = replace_na(pct_mechPump, 0)) %>%
mutate(`pct_tapStand` = replace_na(pct_tapStand, 0))
summary(wp_nga) shapeName geometry total_wp wp_functional
Length:774 MULTIPOLYGON :774 Min. : 0.0 Min. : 0.00
Class :character epsg:4326 : 0 1st Qu.: 45.0 1st Qu.: 17.00
Mode :character +proj=long...: 0 Median : 96.0 Median : 45.50
Mean :122.7 Mean : 67.36
3rd Qu.:168.8 3rd Qu.: 87.75
Max. :894.0 Max. :752.00
pct_functional wp_nonFunctional pct_nonFunctional wp_unknown
Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
1st Qu.: 32.61 1st Qu.: 12.00 1st Qu.: 20.77 1st Qu.: 0.00
Median : 47.41 Median : 33.50 Median : 34.89 Median : 0.00
Mean : 49.84 Mean : 41.37 Mean : 35.58 Mean : 13.76
3rd Qu.: 66.99 3rd Qu.: 60.00 3rd Qu.: 50.00 3rd Qu.: 17.75
Max. :100.00 Max. :278.00 Max. :100.00 Max. :219.00
pct_unknown total_handPump total_mechPump total_tapStand
Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.0000
1st Qu.: 0.00 1st Qu.: 6.00 1st Qu.: 11.00 1st Qu.: 0.0000
Median : 0.00 Median : 47.00 Median : 25.50 Median : 0.0000
Mean : 12.55 Mean : 75.89 Mean : 33.12 Mean : 0.7145
3rd Qu.: 20.83 3rd Qu.:111.00 3rd Qu.: 46.00 3rd Qu.: 0.0000
Max. :100.00 Max. :764.00 Max. :245.00 Max. :42.0000
pct_handPump pct_mechPump pct_tapStand
Min. : 0.00 Min. : 0.00 Min. : 0.0000
1st Qu.: 16.70 1st Qu.: 12.20 1st Qu.: 0.0000
Median : 50.99 Median : 31.27 Median : 0.0000
Mean : 48.73 Mean : 37.54 Mean : 0.5794
3rd Qu.: 77.78 3rd Qu.: 57.71 3rd Qu.: 0.0000
Max. :100.00 Max. :100.00 Max. :32.8947
3.4.5.14 visualise wp_nga distribution :: “water_tech_category”
handPump <- tm_shape(bdy_nga)+
tm_polygons(alpha = 0.1) +
tm_shape(wp_nga) +
tm_dots(col = "pct_handPump",
border.col = "gray60",
border.lwd = 0.5) +
tm_view(set.zoom.limits = c(5,9))
mechPump <- tm_shape(bdy_nga)+
tm_polygons(alpha = 0.1) +
tm_shape(wp_nga) +
tm_dots(col = "pct_mechPump",
border.col = "gray60",
border.lwd = 0.5) +
tm_view(set.zoom.limits = c(5,9))
tapStand <- tm_shape(bdy_nga)+
tm_polygons(alpha = 0.1) +
tm_shape(wp_nga) +
tm_dots(col = "pct_tapStand",
border.col = "gray60",
border.lwd = 0.5) +
tm_view(set.zoom.limits = c(5,9))
tmap_arrange(handPump, mechPump, tapStand,
asp=1,
ncol=2,
sync = TRUE)
3.4.5.15 extract “usage_capacity” to wp_nga
freq(wp_joined1$usage_capacity, sort = "dec") n % val%
300 68789 72.4 72.4
1000 25644 27.0 27.0
250 573 0.6 0.6
50 2 0.0 0.0
Remarks :
Only “300”, “1000”, and “250” are to be extracted for further analysis as the rest are either less than 0.5% or “Unknown”.
But, “50” will be included in the new variable “total_ucN1000” as part of the none ‘1000’ “usage_capacity” value.
uCap_300 <- wp_joined1 %>%
filter(usage_capacity %in%
"300")
uCap_1000 <- wp_joined1 %>%
filter(usage_capacity %in%
"1000")
uCap_250 <- wp_joined1 %>%
filter(usage_capacity %in%
"250")
uCap_50 <- wp_joined1 %>%
filter(usage_capacity %in%
"50")
wp_nga <- wp_nga %>%
mutate(`total_uc300` = lengths(
st_intersects(bdy_nga, uCap_300)
)) %>%
mutate(`total_uc1000` = lengths(
st_intersects(bdy_nga, uCap_1000)
)) %>%
mutate(`total_uc250` = lengths(
st_intersects(bdy_nga, uCap_250)
)) %>%
mutate(`total_uc50` = lengths(
st_intersects(bdy_nga, uCap_50)
)) %>%
mutate(`total_ucN1000` = ((lengths(
st_intersects(
bdy_nga, uCap_300))) + (lengths(
st_intersects(
bdy_nga, uCap_250))) + (lengths(
st_intersects(
bdy_nga, uCap_50))))
)%>%
mutate(`pct_ucN1000` = (`total_ucN1000`/`total_wp`*100)) %>%
mutate(`pct_uc300` = (`total_uc300`/`total_wp`*100)) %>%
mutate(`pct_uc1000` = (`total_uc1000`/`total_wp`*100)) %>%
mutate(`pct_uc250` = (`total_uc250`/`total_wp`*100))-- replace “NaN” with 0
wp_nga <- wp_nga %>%
mutate(`pct_ucN1000` = replace_na(pct_ucN1000, 0)) %>%
mutate(`pct_uc300` = replace_na(pct_uc300, 0)) %>%
mutate(`pct_uc1000` = replace_na(pct_uc1000, 0)) %>%
mutate(`pct_uc250` = replace_na(pct_uc250, 0))
summary(wp_nga) shapeName geometry total_wp wp_functional
Length:774 MULTIPOLYGON :774 Min. : 0.0 Min. : 0.00
Class :character epsg:4326 : 0 1st Qu.: 45.0 1st Qu.: 17.00
Mode :character +proj=long...: 0 Median : 96.0 Median : 45.50
Mean :122.7 Mean : 67.36
3rd Qu.:168.8 3rd Qu.: 87.75
Max. :894.0 Max. :752.00
pct_functional wp_nonFunctional pct_nonFunctional wp_unknown
Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
1st Qu.: 32.61 1st Qu.: 12.00 1st Qu.: 20.77 1st Qu.: 0.00
Median : 47.41 Median : 33.50 Median : 34.89 Median : 0.00
Mean : 49.84 Mean : 41.37 Mean : 35.58 Mean : 13.76
3rd Qu.: 66.99 3rd Qu.: 60.00 3rd Qu.: 50.00 3rd Qu.: 17.75
Max. :100.00 Max. :278.00 Max. :100.00 Max. :219.00
pct_unknown total_handPump total_mechPump total_tapStand
Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.0000
1st Qu.: 0.00 1st Qu.: 6.00 1st Qu.: 11.00 1st Qu.: 0.0000
Median : 0.00 Median : 47.00 Median : 25.50 Median : 0.0000
Mean : 12.55 Mean : 75.89 Mean : 33.12 Mean : 0.7145
3rd Qu.: 20.83 3rd Qu.:111.00 3rd Qu.: 46.00 3rd Qu.: 0.0000
Max. :100.00 Max. :764.00 Max. :245.00 Max. :42.0000
pct_handPump pct_mechPump pct_tapStand total_uc300
Min. : 0.00 Min. : 0.00 Min. : 0.0000 Min. : 0.00
1st Qu.: 16.70 1st Qu.: 12.20 1st Qu.: 0.0000 1st Qu.: 15.25
Median : 50.99 Median : 31.27 Median : 0.0000 Median : 59.00
Mean : 48.73 Mean : 37.54 Mean : 0.5794 Mean : 88.85
3rd Qu.: 77.78 3rd Qu.: 57.71 3rd Qu.: 0.0000 3rd Qu.:126.75
Max. :100.00 Max. :100.00 Max. :32.8947 Max. :767.00
total_uc1000 total_uc250 total_uc50 total_ucN1000
Min. : 0.00 Min. : 0.0000 Min. :0.000000 Min. : 0.00
1st Qu.: 11.00 1st Qu.: 0.0000 1st Qu.:0.000000 1st Qu.: 16.00
Median : 25.50 Median : 0.0000 Median :0.000000 Median : 60.00
Mean : 33.12 Mean : 0.7403 Mean :0.002584 Mean : 89.59
3rd Qu.: 46.00 3rd Qu.: 0.0000 3rd Qu.:0.000000 3rd Qu.:127.75
Max. :245.00 Max. :42.0000 Max. :1.000000 Max. :767.00
pct_ucN1000 pct_uc300 pct_uc1000 pct_uc250
Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.0000
1st Qu.: 39.68 1st Qu.: 38.67 1st Qu.: 12.20 1st Qu.: 0.0000
Median : 67.03 Median : 65.91 Median : 31.27 Median : 0.0000
Mean : 60.78 Mean : 60.17 Mean : 37.54 Mean : 0.6114
3rd Qu.: 87.35 3rd Qu.: 87.02 3rd Qu.: 57.71 3rd Qu.: 0.0000
Max. :100.00 Max. :100.00 Max. :100.00 Max. :32.8947
3.4.5.16 visualise wp_nga distribution :: “usage_capacity”
uc300 <- tm_shape(bdy_nga)+
tm_polygons(alpha = 0.1) +
tm_shape(wp_nga) +
tm_dots(col = "pct_uc300",
border.col = "gray60",
border.lwd = 0.5) +
tm_view(set.zoom.limits = c(5,9))
uc1000 <- tm_shape(bdy_nga)+
tm_polygons(alpha = 0.1) +
tm_shape(wp_nga) +
tm_dots(col = "pct_uc1000",
border.col = "gray60",
border.lwd = 0.5) +
tm_view(set.zoom.limits = c(5,9))
uc250 <- tm_shape(bdy_nga)+
tm_polygons(alpha = 0.1) +
tm_shape(wp_nga) +
tm_dots(col = "pct_uc250",
border.col = "gray60",
border.lwd = 0.5) +
tm_view(set.zoom.limits = c(5,9))
tmap_arrange(uc300, uc1000, uc250,
asp=1,
ncol=2,
sync = TRUE)
3.4.5.17 extract “is_urban” to wp_nga
urban_1 <- wp_joined1 %>%
filter(is_urban %in%
"TRUE")
urban_0 <- wp_joined1 %>%
filter(is_urban %in%
"FALSE")
wp_nga <- wp_nga %>%
mutate(`total_urban1` = lengths(
st_intersects(bdy_nga, urban_1)
)) %>%
mutate(`total_urban0` = lengths(
st_intersects(bdy_nga, urban_0)
)) %>%
mutate(`pct_urban1` = (`total_urban1`/`total_wp`*100)) %>%
mutate(`pct_urban0` = (`total_urban0`/`total_wp`*100))-- replace “NaN” with 0
wp_nga <- wp_nga %>%
mutate(`pct_urban1` = replace_na(pct_urban1, 0)) %>%
mutate(`pct_urban0` = replace_na(pct_urban0, 0))
summary(wp_nga) shapeName geometry total_wp wp_functional
Length:774 MULTIPOLYGON :774 Min. : 0.0 Min. : 0.00
Class :character epsg:4326 : 0 1st Qu.: 45.0 1st Qu.: 17.00
Mode :character +proj=long...: 0 Median : 96.0 Median : 45.50
Mean :122.7 Mean : 67.36
3rd Qu.:168.8 3rd Qu.: 87.75
Max. :894.0 Max. :752.00
pct_functional wp_nonFunctional pct_nonFunctional wp_unknown
Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
1st Qu.: 32.61 1st Qu.: 12.00 1st Qu.: 20.77 1st Qu.: 0.00
Median : 47.41 Median : 33.50 Median : 34.89 Median : 0.00
Mean : 49.84 Mean : 41.37 Mean : 35.58 Mean : 13.76
3rd Qu.: 66.99 3rd Qu.: 60.00 3rd Qu.: 50.00 3rd Qu.: 17.75
Max. :100.00 Max. :278.00 Max. :100.00 Max. :219.00
pct_unknown total_handPump total_mechPump total_tapStand
Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.0000
1st Qu.: 0.00 1st Qu.: 6.00 1st Qu.: 11.00 1st Qu.: 0.0000
Median : 0.00 Median : 47.00 Median : 25.50 Median : 0.0000
Mean : 12.55 Mean : 75.89 Mean : 33.12 Mean : 0.7145
3rd Qu.: 20.83 3rd Qu.:111.00 3rd Qu.: 46.00 3rd Qu.: 0.0000
Max. :100.00 Max. :764.00 Max. :245.00 Max. :42.0000
pct_handPump pct_mechPump pct_tapStand total_uc300
Min. : 0.00 Min. : 0.00 Min. : 0.0000 Min. : 0.00
1st Qu.: 16.70 1st Qu.: 12.20 1st Qu.: 0.0000 1st Qu.: 15.25
Median : 50.99 Median : 31.27 Median : 0.0000 Median : 59.00
Mean : 48.73 Mean : 37.54 Mean : 0.5794 Mean : 88.85
3rd Qu.: 77.78 3rd Qu.: 57.71 3rd Qu.: 0.0000 3rd Qu.:126.75
Max. :100.00 Max. :100.00 Max. :32.8947 Max. :767.00
total_uc1000 total_uc250 total_uc50 total_ucN1000
Min. : 0.00 Min. : 0.0000 Min. :0.000000 Min. : 0.00
1st Qu.: 11.00 1st Qu.: 0.0000 1st Qu.:0.000000 1st Qu.: 16.00
Median : 25.50 Median : 0.0000 Median :0.000000 Median : 60.00
Mean : 33.12 Mean : 0.7403 Mean :0.002584 Mean : 89.59
3rd Qu.: 46.00 3rd Qu.: 0.0000 3rd Qu.:0.000000 3rd Qu.:127.75
Max. :245.00 Max. :42.0000 Max. :1.000000 Max. :767.00
pct_ucN1000 pct_uc300 pct_uc1000 pct_uc250
Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.0000
1st Qu.: 39.68 1st Qu.: 38.67 1st Qu.: 12.20 1st Qu.: 0.0000
Median : 67.03 Median : 65.91 Median : 31.27 Median : 0.0000
Mean : 60.78 Mean : 60.17 Mean : 37.54 Mean : 0.6114
3rd Qu.: 87.35 3rd Qu.: 87.02 3rd Qu.: 57.71 3rd Qu.: 0.0000
Max. :100.00 Max. :100.00 Max. :100.00 Max. :32.8947
total_urban1 total_urban0 pct_urban1 pct_urban0
Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
1st Qu.: 0.00 1st Qu.: 23.00 1st Qu.: 0.00 1st Qu.: 57.27
Median : 9.00 Median : 64.00 Median : 11.95 Median : 86.45
Mean : 25.27 Mean : 97.45 Mean : 25.61 Mean : 72.71
3rd Qu.: 33.00 3rd Qu.:141.00 3rd Qu.: 38.44 3rd Qu.:100.00
Max. :324.00 Max. :894.00 Max. :100.00 Max. :100.00
3.4.5.18 visualise wp_nga distribution :: “is_urban”
urban1 <- tm_shape(bdy_nga)+
tm_polygons(alpha = 0.1) +
tm_shape(wp_nga) +
tm_dots(col = "pct_urban1",
border.col = "gray60",
border.lwd = 0.5) +
tm_view(set.zoom.limits = c(5,9))
urban0 <- tm_shape(bdy_nga)+
tm_polygons(alpha = 0.1) +
tm_shape(wp_nga) +
tm_dots(col = "pct_urban0",
border.col = "gray60",
border.lwd = 0.5) +
tm_view(set.zoom.limits = c(5,9))
tmap_arrange(urban1, urban0,
asp=1,
ncol=2,
sync = TRUE)
3.4.5.19 save and read RDS File :: wp_nga
write_rds(wp_nga,"data/geodata/wp_nga.rds")
wp_nga <- read_rds("data/geodata/wp_nga.rds")3.4.5.20 transform to Projected Coordinate System
Usage of the code chunk below :
st_crs( ) - sf - to inspect the coordinate reference system.
st_crs(wp_nga)Coordinate Reference System:
User input: EPSG:4326
wkt:
GEOGCRS["WGS 84",
ENSEMBLE["World Geodetic System 1984 ensemble",
MEMBER["World Geodetic System 1984 (Transit)"],
MEMBER["World Geodetic System 1984 (G730)"],
MEMBER["World Geodetic System 1984 (G873)"],
MEMBER["World Geodetic System 1984 (G1150)"],
MEMBER["World Geodetic System 1984 (G1674)"],
MEMBER["World Geodetic System 1984 (G1762)"],
MEMBER["World Geodetic System 1984 (G2139)"],
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]],
ENSEMBLEACCURACY[2.0]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["geodetic latitude (Lat)",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433]],
AXIS["geodetic longitude (Lon)",east,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433]],
USAGE[
SCOPE["Horizontal component of 3D system."],
AREA["World."],
BBOX[-90,-180,90,180]],
ID["EPSG",4326]]
Remarks :
The EPSG for wp_nga is 4326, which is WGS 84. To compute the proximity distance matrix for clustering analysis, this coordinate reference system needs to transform into EPSG: 26391.
Usage of the code chunk below :
st_set_crs( ) - sf - to update the coordinate reference system.
wp_ngaTrans <- st_set_crs(wp_nga, 26391)Warning: st_crs<- : replacing crs does not reproject data; use st_transform for
that
bdy_ngaTrans <- st_set_crs(bdy_nga, 26391)Warning: st_crs<- : replacing crs does not reproject data; use st_transform for
that
-- review CRS :: wp_ngaTrans
st_crs(wp_ngaTrans)Coordinate Reference System:
User input: EPSG:26391
wkt:
PROJCRS["Minna / Nigeria West Belt",
BASEGEOGCRS["Minna",
DATUM["Minna",
ELLIPSOID["Clarke 1880 (RGS)",6378249.145,293.465,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4263]],
CONVERSION["Nigeria West Belt",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",4,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",4.5,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",0.99975,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",230738.26,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",0,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["(E)",east,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["(N)",north,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Engineering survey, topographic mapping."],
AREA["Nigeria - onshore west of 6°30'E, onshore and offshore shelf."],
BBOX[3.57,2.69,13.9,6.5]],
ID["EPSG",26391]]
-- review CRS :: bdy_ngaTrans
st_crs(bdy_ngaTrans)Coordinate Reference System:
User input: EPSG:26391
wkt:
PROJCRS["Minna / Nigeria West Belt",
BASEGEOGCRS["Minna",
DATUM["Minna",
ELLIPSOID["Clarke 1880 (RGS)",6378249.145,293.465,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4263]],
CONVERSION["Nigeria West Belt",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",4,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",4.5,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",0.99975,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",230738.26,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",0,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["(E)",east,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["(N)",north,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Engineering survey, topographic mapping."],
AREA["Nigeria - onshore west of 6°30'E, onshore and offshore shelf."],
BBOX[3.57,2.69,13.9,6.5]],
ID["EPSG",26391]]
3.5 Exploratory Data Analysis
3.5.1 Identify Outliers
3.5.1.1 plot boxplot “pct_functional”
ggplot(data=wp_ngaTrans,
aes(x=`pct_functional`)) +
geom_boxplot(color="black",
fill="#543005")
3.5.1.2 plot boxplot “pct_nonFunctional”
ggplot(data=wp_ngaTrans,
aes(x=`pct_nonFunctional`)) +
geom_boxplot(color="black",
fill="#C16622FF")
3.5.1.3 plot boxplot “pct_unknown”
ggplot(data=wp_ngaTrans,
aes(x=`pct_unknown`)) +
geom_boxplot(color="black",
fill="#FFA319FF")
Remarks :
Among these 3 key categories of “status_clean”, “unknown” has the most outliers.
3.5.2 Multi-plot Histogram
3.5.2.1 plot histogram for “status_clean”
pctFunctional <- ggplot(data = wp_ngaTrans,
aes(x = `pct_functional`))+
geom_histogram(bins=10,
colour = "black",
fill = "#543005")
pctNonFunctional <- ggplot(data = wp_ngaTrans,
aes(x = `pct_nonFunctional`))+
geom_histogram(bins=10,
colour = "black",
fill = "#C16622FF")
pctUnknown <- ggplot(data = wp_ngaTrans,
aes(x = `pct_unknown`))+
geom_histogram(bins = 10,
colour = "black",
fill = "#FFA319FF")ggarrange(pctFunctional,pctNonFunctional,pctUnknown,
ncol = 2,
nrow = 2)adding dummy grobs

4. CORRELATION ANALYSIS
4.1 Create Data Table for Correlation Matrix Analysis
cluster_vars <- wp_ngaTrans %>%
st_set_geometry(NULL) %>%
select("shapeName",
"pct_functional",
"pct_nonFunctional",
"pct_unknown",
"pct_handPump",
"pct_mechPump",
"pct_tapStand",
"pct_uc300",
"pct_uc1000",
"pct_ucN1000",
"pct_uc250",
"pct_urban0")
head(cluster_vars,5) shapeName pct_functional pct_nonFunctional pct_unknown pct_handPump
1 Aba North 41.17647 52.94118 5.882353 11.764706
2 Aba South 40.84507 46.47887 9.859155 9.859155
3 Abadam 0.00000 0.00000 0.000000 0.000000
4 Abaji 40.35088 59.64912 0.000000 40.350877
5 Abak 47.91667 50.00000 0.000000 8.333333
pct_mechPump pct_tapStand pct_uc300 pct_uc1000 pct_ucN1000 pct_uc250
1 82.35294 0 17.647059 82.35294 17.647059 0
2 87.32394 0 12.676056 87.32394 12.676056 0
3 0.00000 0 0.000000 0.00000 0.000000 0
4 59.64912 0 40.350877 59.64912 40.350877 0
5 91.66667 0 8.333333 91.66667 8.333333 0
pct_urban0
1 0.000000
2 5.633803
3 0.000000
4 84.210526
5 83.333333
4.2 Visualise Correlation Matrix
Usage of the code chunk below :
corrplot.mixed( ) - corrplot - to use mixed methods to visualise a correlation matrix.
This plot allows to identify the pattern and the relationship in the matrix.
corrplot.mixed((cor(cluster_vars[,2:12])),
upper = "number",
lower = "ellipse",
tl.col = "black",
diag = "l",
tl.pos = "lt")
Remarks :
Following are the pairs with strong correlation :
| correlation coefficients | variable_1 | variable_2 |
|---|---|---|
| 1.00 | pct_mechPump | pct_uc1000 |
| 0.99 | pct_tapStand | pct_uc250 |
| 0.99 | pct_uc300 | pct_ucN1000 |
| -0.91 | pct_mechPump | pct_ucN1000 |
| -0.91 | pct_uc1000 | pct_ucN1000 |
| -0.90 | pct_mechPump | pct_uc300 |
| -0.90 | pct_uc300 | pct_uc1000 |
4.2.1 Replace Row ID with “shapeName”
row.names(cluster_vars) <- cluster_vars$shapeName4.2.2 Trim High Correlation Variable and “shapeName”
cluster_varsTrim <- cluster_vars %>%
select(-shapeName, -pct_ucN1000, -pct_mechPump)4.2.2.1 review trimmed data table
summary(cluster_varsTrim) pct_functional pct_nonFunctional pct_unknown pct_handPump
Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
1st Qu.: 32.61 1st Qu.: 20.77 1st Qu.: 0.00 1st Qu.: 16.70
Median : 47.41 Median : 34.89 Median : 0.00 Median : 50.99
Mean : 49.84 Mean : 35.58 Mean : 12.55 Mean : 48.73
3rd Qu.: 66.99 3rd Qu.: 50.00 3rd Qu.: 20.83 3rd Qu.: 77.78
Max. :100.00 Max. :100.00 Max. :100.00 Max. :100.00
pct_tapStand pct_uc300 pct_uc1000 pct_uc250
Min. : 0.0000 Min. : 0.00 Min. : 0.00 Min. : 0.0000
1st Qu.: 0.0000 1st Qu.: 38.67 1st Qu.: 12.20 1st Qu.: 0.0000
Median : 0.0000 Median : 65.91 Median : 31.27 Median : 0.0000
Mean : 0.5794 Mean : 60.17 Mean : 37.54 Mean : 0.6114
3rd Qu.: 0.0000 3rd Qu.: 87.02 3rd Qu.: 57.71 3rd Qu.: 0.0000
Max. :32.8947 Max. :100.00 Max. :100.00 Max. :32.8947
pct_urban0
Min. : 0.00
1st Qu.: 57.27
Median : 86.45
Mean : 72.71
3rd Qu.:100.00
Max. :100.00
5. CLUSTERING ANALYSIS
5.1 Hierarchy Clustering
There are four (4) main steps :
- compute proximity matrix.
- assign data point to a cluster.
- merge clusters based on similarity between clusters.
- update the distance matrix.
5.1.1 Standardise Data
As shown in the 4.2.3.1, there are few variables with Max. different from others. Hence, standardisation will be required prior to further analysis.
5.1.1.1 standardise with min-max method
nga_wpStd <- normalize(cluster_varsTrim)
summary(nga_wpStd) pct_functional pct_nonFunctional pct_unknown pct_handPump
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.3261 1st Qu.:0.2077 1st Qu.:0.0000 1st Qu.:0.1670
Median :0.4741 Median :0.3489 Median :0.0000 Median :0.5099
Mean :0.4984 Mean :0.3558 Mean :0.1255 Mean :0.4873
3rd Qu.:0.6699 3rd Qu.:0.5000 3rd Qu.:0.2083 3rd Qu.:0.7778
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
pct_tapStand pct_uc300 pct_uc1000 pct_uc250
Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.00000
1st Qu.:0.00000 1st Qu.:0.3867 1st Qu.:0.1220 1st Qu.:0.00000
Median :0.00000 Median :0.6591 Median :0.3127 Median :0.00000
Mean :0.01761 Mean :0.6017 Mean :0.3754 Mean :0.01859
3rd Qu.:0.00000 3rd Qu.:0.8702 3rd Qu.:0.5771 3rd Qu.:0.00000
Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.00000
pct_urban0
Min. :0.0000
1st Qu.:0.5727
Median :0.8645
Mean :0.7271
3rd Qu.:1.0000
Max. :1.0000
5.1.1.2 standardise with Z-score method
nga_wpZ <- scale(cluster_varsTrim)
describe(nga_wpZ) vars n mean sd median trimmed mad min max range skew
pct_functional 1 774 0 1 -0.10 -0.02 1.04 -2.06 2.07 4.13 0.14
pct_nonFunctional 2 774 0 1 -0.03 -0.02 1.05 -1.71 3.10 4.81 0.23
pct_unknown 3 774 0 1 -0.62 -0.22 0.00 -0.62 4.30 4.92 2.01
pct_handPump 4 774 0 1 0.07 0.00 1.37 -1.49 1.57 3.06 -0.09
pct_tapStand 5 774 0 1 -0.19 -0.19 0.00 -0.19 10.46 10.65 7.22
pct_uc300 6 774 0 1 0.19 0.08 1.10 -2.02 1.34 3.35 -0.56
pct_uc1000 7 774 0 1 -0.21 -0.09 1.05 -1.28 2.14 3.42 0.61
pct_uc250 8 774 0 1 -0.20 -0.20 0.00 -0.20 10.37 10.57 7.10
pct_urban0 9 774 0 1 0.42 0.17 0.62 -2.23 0.84 3.06 -1.12
kurtosis se
pct_functional -0.62 0.04
pct_nonFunctional -0.42 0.04
pct_unknown 4.15 0.04
pct_handPump -1.33 0.04
pct_tapStand 58.65 0.04
pct_uc300 -0.87 0.04
pct_uc1000 -0.78 0.04
pct_uc250 57.14 0.04
pct_urban0 -0.09 0.04
5.1.1.3 visualise distribution of standardised clustering variable
-- functional water point
fwp <- ggplot(data=cluster_varsTrim,
aes(x= `pct_functional`)) +
geom_histogram(bins=20,
color="black",
fill="steelblue") +
ggtitle("Before Standardisation")
fwp_stdDf <- as.data.frame(nga_wpStd)
fwp_std <- ggplot(data=fwp_stdDf,
aes(x=`pct_functional`)) +
geom_histogram(bins=20,
color="black",
fill="steelblue") +
ggtitle("Min-Max Stdsn.")
fwp_zDf <- as.data.frame(nga_wpZ)
fwp_z <- ggplot(data=fwp_zDf,
aes(x=`pct_functional`)) +
geom_histogram(bins=20,
color="black",
fill="steelblue") +
ggtitle("Z-score Stdsn.")
ggarrange(fwp, fwp_std, fwp_z,
ncol = 3,
nrow = 1)
fwp <- ggplot(data=cluster_varsTrim,
aes(x= `pct_functional`)) +
geom_density(color="black",
fill="steelblue") +
ggtitle("Before Standardisation")
fwp_stdDf <- as.data.frame(nga_wpStd)
fwp_std <- ggplot(data=fwp_stdDf,
aes(x=`pct_functional`)) +
geom_density(color="black",
fill="steelblue") +
ggtitle("Min-Max Stdsn.")
fwp_zDf <- as.data.frame(nga_wpZ)
fwp_z <- ggplot(data=fwp_zDf,
aes(x=`pct_functional`)) +
geom_density(color="black",
fill="steelblue") +
ggtitle("Z-score Stdsn.")
ggarrange(fwp, fwp_std, fwp_z,
ncol = 3,
nrow = 1)
-- water point deployed with handpump
HP <- ggplot(data=cluster_varsTrim,
aes(x= `pct_handPump`)) +
geom_histogram(bins=20,
color="black",
fill="steelblue") +
ggtitle("Before Standardisation")
fwp_stdDf <- as.data.frame(nga_wpStd)
HP_std <- ggplot(data=fwp_stdDf,
aes(x=`pct_handPump`)) +
geom_histogram(bins=20,
color="black",
fill="steelblue") +
ggtitle("Min-Max Stdsn.")
fwp_zDf <- as.data.frame(nga_wpZ)
HP_z <- ggplot(data=fwp_zDf,
aes(x=`pct_handPump`)) +
geom_histogram(bins=20,
color="black",
fill="steelblue") +
ggtitle("Z-score Stdsn.")
ggarrange(HP, HP_std, HP_z,
ncol = 3,
nrow = 1)
HP <- ggplot(data=cluster_varsTrim,
aes(x= `pct_handPump`)) +
geom_density(color="black",
fill="steelblue") +
ggtitle("Before Standardisation")
fwp_stdDf <- as.data.frame(nga_wpStd)
HP_std <- ggplot(data=fwp_stdDf,
aes(x=`pct_handPump`)) +
geom_density(color="black",
fill="steelblue") +
ggtitle("Min-Max Stdsn.")
fwp_zDf <- as.data.frame(nga_wpZ)
HP_z <- ggplot(data=fwp_zDf,
aes(x=`pct_handPump`)) +
geom_density(color="black",
fill="steelblue") +
ggtitle("Z-score Stdsn.")
ggarrange(HP, HP_std, HP_z,
ncol = 3,
nrow = 1)
-- water point with 1000 users usage capacity
uc1000 <- ggplot(data=cluster_varsTrim,
aes(x= `pct_uc1000`)) +
geom_histogram(bins=20,
color="black",
fill="steelblue") +
ggtitle("Before Standardisation")
fwp_stdDf <- as.data.frame(nga_wpStd)
uc1000_std <- ggplot(data=fwp_stdDf,
aes(x=`pct_uc1000`)) +
geom_histogram(bins=20,
color="black",
fill="steelblue") +
ggtitle("Min-Max Stdsn.")
fwp_zDf <- as.data.frame(nga_wpZ)
uc1000_z <- ggplot(data=fwp_zDf,
aes(x=`pct_uc1000`)) +
geom_histogram(bins=20,
color="black",
fill="steelblue") +
ggtitle("Z-score Stdsn.")
ggarrange(uc1000, uc1000_std, uc1000_z,
ncol = 3,
nrow = 1)
uc1000 <- ggplot(data=cluster_varsTrim,
aes(x= `pct_uc1000`)) +
geom_density(color="black",
fill="steelblue") +
ggtitle("Before Standardisation")
fwp_stdDf <- as.data.frame(nga_wpStd)
uc1000_std <- ggplot(data=fwp_stdDf,
aes(x=`pct_uc1000`)) +
geom_density(color="black",
fill="steelblue") +
ggtitle("Min-Max Stdsn.")
fwp_zDf <- as.data.frame(nga_wpZ)
uc1000_z <- ggplot(data=fwp_zDf,
aes(x=`pct_uc1000`)) +
geom_density(color="black",
fill="steelblue") +
ggtitle("Z-score Stdsn.")
ggarrange(uc1000, uc1000_std, uc1000_z,
ncol = 3,
nrow = 1)
5.1.2 Compute Proximity Matrix
Usage of the code chunk below :
dist( ) - stats - to compute the proximity distance matrix. Among euclidean, maximum, manhattan, canberra, binary and minkowski, euclidean is used to compute proxmat_euc.
proxmat_euc <- dist(cluster_varsTrim, method = 'euclidean')5.1.3 Compute Hierarchical Clustering
Usage of the code chunk below :
hclust( ) - stats - to compute cluster with agglomeration method.
ggdendrogram( ) - ggdendro - to plot dendrogram with tools available in ggplot2.
hieClust_warD <- hclust(proxmat_euc, method = 'ward.D')
ggdendrogram(hieClust_warD,
rotate = TRUE,
size = 2,
theme_dendro = FALSE)
5.1.4 Determine Optimal Clustering Algorithm
Usage of the code chunk below :
agnes( ) - cluster - to get agglomerative coefficient of 4 clustering structure, namely “average”, “single”, “complete” and “ward”.
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
ac <- function(x) {
agnes(cluster_varsTrim, method = x)$ac
}
map_dbl(m, ac) average single complete ward
0.9264460 0.8825086 0.9494033 0.9923235
Remarks :
Value 1 indicate strongest clustering structure.
Ward’s method provides the strongest clustering structure. Therefore, Ward’s method to be used in subsequent analysis.
5.1.5 Determine Optimal Clusters
To determine the optimal clusters to retain, following commons methods are tested :
Gap statistic
Elbow
Average Silhouette
5.1.5.1 compute Gap Statistic method
Usage of the code chunk below :
clusGap( ) - cluster - to compute the gap statistic.
set.seed(12345)
gap_stat <- clusGap(cluster_varsTrim,
FUN = hcut,
nstart = 25,
K.max = 30,
B = 50)
# Print the result
print(gap_stat, method = "firstmax")Clustering Gap statistic ["clusGap"] from call:
clusGap(x = cluster_varsTrim, FUNcluster = hcut, K.max = 30, B = 50, nstart = 25)
B=50 simulated reference sets, k = 1..30; spaceH0="scaledPCA"
--> Number of clusters (method 'firstmax'): 30
logW E.logW gap SE.sim
[1,] 9.815280 10.315513 0.5002330 0.008043258
[2,] 9.602465 10.203144 0.6006791 0.008510149
[3,] 9.510126 10.144007 0.6338806 0.010041659
[4,] 9.443727 10.094054 0.6503272 0.009408325
[5,] 9.337773 10.055036 0.7172630 0.008377048
[6,] 9.286151 10.021049 0.7348986 0.008061617
[7,] 9.226030 9.992116 0.7660864 0.007509762
[8,] 9.181055 9.967079 0.7860239 0.007406854
[9,] 9.132662 9.944999 0.8123364 0.007505962
[10,] 9.088576 9.924941 0.8363644 0.008003248
[11,] 9.057435 9.906340 0.8489044 0.008086779
[12,] 9.019733 9.888898 0.8691655 0.008378669
[13,] 8.988798 9.872948 0.8841499 0.008499708
[14,] 8.962331 9.857905 0.8955736 0.008609360
[15,] 8.932159 9.843557 0.9113980 0.008574353
[16,] 8.908477 9.830133 0.9216564 0.008442998
[17,] 8.880801 9.817233 0.9364313 0.008235439
[18,] 8.846775 9.805149 0.9583744 0.008068772
[19,] 8.828254 9.793671 0.9654167 0.007975742
[20,] 8.812263 9.782502 0.9702393 0.007904523
[21,] 8.793736 9.771938 0.9782012 0.007903082
[22,] 8.777957 9.761659 0.9837022 0.007927244
[23,] 8.762944 9.751686 0.9887413 0.007838986
[24,] 8.745719 9.741903 0.9961837 0.007850884
[25,] 8.732706 9.732508 0.9998020 0.007792150
[26,] 8.716858 9.723358 1.0064996 0.007813310
[27,] 8.703095 9.714414 1.0113191 0.007684052
[28,] 8.684688 9.705770 1.0210819 0.007586041
[29,] 8.664250 9.697408 1.0331579 0.007592467
[30,] 8.649964 9.689233 1.0392698 0.007607051
-- visualise gap_stat
Usage of the code chunk below :
fviz_nbclust( ) - factoextra - to compute and visualise the Optimal Number of clusters.
set.seed(12345)
fviz_nbclust(nga_wpZ,
kmeans,
nstart = 25,
method = "gap_stat",
nboot = 50)+
labs(subtitle = "Gap statistic method")Warning: did not converge in 10 iterations
Warning: did not converge in 10 iterations
Warning: did not converge in 10 iterations

5.1.5.2 compute and visualise Elbow method
fviz_nbclust(nga_wpZ, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2)+
labs(subtitle = "Elbow method")
5.1.5.3 compute and visualise Silhouette method
fviz_nbclust(nga_wpZ, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette method")
Remarks :
Given the Elbow method, Silhouette method and Gap Statistic method, the 5-cluster by Silhouette method will be used for the rest of the study.
5.1.5.4 interpret with Dendrogram
Usage of the code chunk below :
rect.hclust( ) - stats - to draw the dendrogram with a border around the selected clusters.
plot(hieClust_warD, cex = 0.6)
rect.hclust(hieClust_warD,
k = 5,
border = 2:5)
5.1.6 Visually-Driven Hierarchical Clustering Analysis
The data is loaded into a data frame, but it has to be a data matrix to plot the heatmap. Hence, the data frame will need to first transform into a matrix.
5.1.6.1 transform data frame into matrix
Usage of the code chunk below :
data.matrix( ) - base - to transform cluster_varsTrim data frame into a data matrix, and named it as nga_clustMat.
nga_clustMat <- data.matrix(cluster_varsTrim)5.1.6.2 plot interactive cluster heatmap
Usage of the code chunk below :
heatmaply( ) - heatmaply - to build an interactive cluster heatmap.
heatmaply(normalize(nga_clustMat),
Colv=NA,
dist_method = "euclidean",
hclust_method = "ward.D",
seriate = "OLO",
colors = Blues,
k_row = 5,
margins = c(NA,200,60,NA),
fontsize_row = 4,
fontsize_col = 5,
main="Geographic Segmentation of Nigeria by Water Points",
xlab = "Water Points",
ylab = "Nigeria LGA"
)Remarks :
Based on the plot above, 5 clusters to be retained for further analysis.
5.1.6.3 map the formed cluster
Usage of the code chunk below :
cutree( ) - base - to derive a 5-cluster model, and named the output as groups.
groups <- as.factor(cutree(hieClust_warD, k=5))5.1.6.4 append groups to wp_ngaTrans
nga_clust.sf <- cbind(wp_ngaTrans, as.matrix(groups)) %>%
rename(`cluster`=`as.matrix.groups.`)5.1.6.5 plot choropleth map :: nga_clust.sf
qtm(nga_clust.sf, "cluster")
Remarks :
The choropleth map above shows the fragmented clusters by the used of non-spatial clustering algorithm (hierarchical cluster analysis method).
5.2 Spatially Constrained Clustering :: SKATER Approach
SKATER function only support sp objects in SpatialPolygonDataFrame. Hence, the wp_ngaTrans has to first transform into SpatialPolygonDataFrame before proceed further.
5.2.1 Convert SF to SP Data Frame
Usage of the code chunk below :
as_Spatial( ) - sf - to convert wp_ngaTrans into nga_sp in a SP data frame.
nga.sp <- as_Spatial(wp_ngaTrans)5.2.2 Compute Neighbour List
Usage of the code chunk below :
poly2nb( ) - spdep - to compute the neighbours list from polygon list.
nga.nb <- poly2nb(nga.sp, queen = TRUE)
summary(nga.nb)Neighbour list object:
Number of regions: 774
Number of nonzero links: 4440
Percentage nonzero weights: 0.7411414
Average number of links: 5.736434
1 region with no links:
86
Link number distribution:
0 1 2 3 4 5 6 7 8 9 10 11 12 14
1 2 14 57 125 182 140 122 72 41 12 4 1 1
2 least connected regions:
138 560 with 1 link
1 most connected region:
508 with 14 links
Remarks :
There is one (1) region, i.e. #86 is without link. It has to be removed first before proceed to plot the neighbours list.
5.2.2.1 remove 0-neighbour region
wp_ngaTrans1 <- wp_ngaTrans[-86,]
cluster_varsTrim1 <- cluster_varsTrim[-86,]
nga_clust.sf1 <- nga_clust.sf[-86,]
nga_wpZ1 <- nga_wpZ[-86,]
nga.sp1 <- as_Spatial(wp_ngaTrans1)
nga.nb1 <- poly2nb(nga.sp1)
summary(nga.nb1)Neighbour list object:
Number of regions: 773
Number of nonzero links: 4440
Percentage nonzero weights: 0.7430602
Average number of links: 5.743855
Link number distribution:
1 2 3 4 5 6 7 8 9 10 11 12 14
2 14 57 125 182 140 122 72 41 12 4 1 1
2 least connected regions:
138 560 with 1 link
1 most connected region:
508 with 14 links
5.2.2.2 plot Neighbour List by Centroid Node
Usage of the code chunk below : plot the boundary first before the neighbour list object to avoid any region from being clipped away.
plot(nga.sp1,
border=grey(.5))
plot(nga.nb1,
coordinates(nga.sp1),
col="blue",
add=TRUE)
5.2.3 Compute Minimum Spanning Tree (MST)
5.2.3.1 calculate edge costs
Usage of the code chunk below :
nbcosts( ) - spdep - to compute the cost of each edge which is the distance between nodes.
edge_cost <- nbcosts(nga.nb1, cluster_varsTrim1)5.2.3.2 specify spatial weight
nb2listw( ) - spdep - to specify edge_cost as the spatial weights. Set the “style” to “B” to ensure the cost values are not row-standardised.
nga.w <- nb2listw(nga.nb1,
edge_cost,
style = "B")Warning in nb2listw(nga.nb1, edge_cost, style = "B"): zero sum general weights
summary(nga.w)Characteristics of weights list object:
Neighbour list object:
Number of regions: 773
Number of nonzero links: 4440
Percentage nonzero weights: 0.7430602
Average number of links: 5.743855
Link number distribution:
1 2 3 4 5 6 7 8 9 10 11 12 14
2 14 57 125 182 140 122 72 41 12 4 1 1
2 least connected regions:
138 560 with 1 link
1 most connected region:
508 with 14 links
Weights style: B
Weights constants summary:
n nn S0 S1 S2
B 773 597529 245120.7 38463020 406298492
5.2.3.3 compute minimum spanning tree
Usage of the code chunk below :
nbcosts( ) - spdep - to compute the minimum spanning tree.
nga_minSpanT <- mstree(nga.w)-- review class and dimension of the computed MST
class(nga_minSpanT)[1] "mst" "matrix"
dim(nga_minSpanT)[1] 772 3
head(nga_minSpanT) [,1] [,2] [,3]
[1,] 474 387 134.66287
[2,] 387 478 64.55618
[3,] 387 439 78.95255
[4,] 439 476 50.91751
[5,] 439 270 105.68114
[6,] 270 90 66.67241
5.2.3.4 plot MST Neighbour List
plot(nga.sp1, border=gray(.5))
plot.mst(nga_minSpanT,
coordinates(nga.sp1),
col="blue",
cex.lab=0.7,
cex.circles=0.005,
add=TRUE)
5.2.4 Compute Spatially Constrained Cluster
Usage of the code chunk below :
skater( ) - spdep - to compute the spatially constrained cluster.
clust5 <- spdep::skater(edges = nga_minSpanT[,1:2],
data = cluster_varsTrim1,
method = "euclidean",
ncuts = 4)
str(clust5)List of 8
$ groups : num [1:773] 3 3 1 5 4 1 2 2 1 3 ...
$ edges.groups:List of 5
..$ :List of 3
.. ..$ node: num [1:309] 773 747 492 131 382 224 413 488 439 257 ...
.. ..$ edge: num [1:308, 1:3] 131 382 224 413 257 767 439 704 476 75 ...
.. ..$ ssw : num 17013
..$ :List of 3
.. ..$ node: num [1:129] 597 315 316 557 195 571 339 744 205 213 ...
.. ..$ edge: num [1:128, 1:3] 315 316 557 195 571 15 82 579 744 351 ...
.. ..$ ssw : num 7874
..$ :List of 3
.. ..$ node: num [1:85] 364 10 729 215 337 551 102 103 66 19 ...
.. ..$ edge: num [1:84, 1:3] 23 536 578 103 19 375 727 617 188 103 ...
.. ..$ ssw : num 4545
..$ :List of 3
.. ..$ node: num [1:39] 550 202 330 287 374 732 537 586 733 201 ...
.. ..$ edge: num [1:38, 1:3] 612 586 136 245 332 429 504 537 586 616 ...
.. ..$ ssw : num 1294
..$ :List of 3
.. ..$ node: num [1:211] 67 510 401 122 24 526 475 489 663 303 ...
.. ..$ edge: num [1:210, 1:3] 67 549 510 119 639 401 556 122 693 70 ...
.. ..$ ssw : num 10380
$ not.prune : NULL
$ candidates : int [1:5] 1 2 3 4 5
$ ssto : num 52660
$ ssw : num [1:5] 52660 48724 44268 42679 41106
$ crit : num [1:2] 1 Inf
$ vec.crit : num [1:773] 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "class")= chr "skater"
5.2.4.1 tabulate cluster assignment
ccs5 <- clust5$groups
table(ccs5)ccs5
1 2 3 4 5
309 129 85 39 211
5.2.4.2 plot the pruned tree
plot(nga.sp1, border=gray(.5))
plot(clust5,
coordinates(nga.sp1),
cex.lab=.7,
groups.colors=c("red","green","blue", "brown", "pink"),
cex.circles=0.005,
add=TRUE)Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
coords[id2, : "add" is not a graphical parameter
Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
coords[id2, : "add" is not a graphical parameter
Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
coords[id2, : "add" is not a graphical parameter
Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
coords[id2, : "add" is not a graphical parameter
Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
coords[id2, : "add" is not a graphical parameter

5.2.5 Visualise SKATER Clusters in Choropleth Map
groups_mat <- as.matrix(clust5$groups)
nga_spClust.sf <- cbind(nga_clust.sf1, as.factor(groups_mat)) %>%
rename(`sp_cluster`=`as.factor.groups_mat.`)To compare the output of hierarchical clustering and spatially constrained hierarchical clustering :
hieClust_map <- qtm(nga_clust.sf1,
"cluster") +
tm_borders(alpha = 0.5)
ngaClust_map <- qtm(nga_spClust.sf,
"sp_cluster") +
tm_borders(alpha = 0.5)
tmap_arrange(hieClust_map, ngaClust_map,
asp=NA, ncol=2)Warning: One tm layer group has duplicated layer types, which are omitted. To
draw multiple layers of the same type, use multiple layer groups (i.e. specify
tm_shape prior to each of them).
Warning: One tm layer group has duplicated layer types, which are omitted. To
draw multiple layers of the same type, use multiple layer groups (i.e. specify
tm_shape prior to each of them).

5.3 Spatially Constrained Clustering :: ClustGeo Method
5.3.1 Perform Ward-like Hierarchical Clustering
Usage of the code chunk below :
hclustgeo( ) - ClustGeo - to perform a typical Ward-like hierarchical clustering.
proxmat_ngc <- dist(cluster_varsTrim1, method = 'euclidean')nonGeo_clust <- hclustgeo(proxmat_ngc)
plot(nonGeo_clust, cex = 0.5)
rect.hclust(nonGeo_clust,
k = 5,
border = 2:5)
5.3.1.1 visualise the formed clusters
groups_ngc <- as.factor(cutree(nonGeo_clust, k=5))nga_ngeo_clust.sf <- cbind(wp_ngaTrans1, as.matrix(groups_ngc)) %>%
rename(`cluster` = `as.matrix.groups_ngc.`)qtm(nga_ngeo_clust.sf, "cluster")
5.3.2 Perform Spatially Constrained Hierarchical Clustering
Usage of the code chunk below :
st_distance( ) - sf - to derive the spatial distance matrix before perform spatially constrained hierarchical clustering.
as.dist( ) - stats - to convert the data frame into matrix.
dist <- st_distance(wp_ngaTrans1, wp_ngaTrans1)
dist_mat <- as.dist(dist)5.3.2.1 determine alpha value
choicealpha( ) - psych - to determine a suitable value for the mixing parameter alpha.
cr <- choicealpha(
proxmat_ngc,
dist_mat,
range.alpha = seq(0, 1, 0.1),
K=5,
graph = TRUE)

Remarks :
With reference to the plot above, alpha = 0.4 to be used to perform spatially constrained hierarchical clustering.
5.3.2.2 compute spatially constrained hierarchical clustering
clustG <- hclustgeo(proxmat_ngc,
dist_mat,
alpha = 0.4)5.3.2.3 derive cluster object
groups_cg <- as.factor(cutree(clustG, k=5))5.3.2.4 combine group_cg with wp_ngaTrans1
wp_nga1_GClust <- cbind(wp_ngaTrans1, as.matrix(groups_cg)) %>%
rename(`cluster` = `as.matrix.groups_cg.`)5.3.2.5 plot delineated spatially constrained cluster
qtm(wp_nga1_GClust, "cluster")
5.4 Visual Interpretation of Clusters
5.4.1 Visualise Individual Clustering Variable
5.4.1.1 plot boxplot
ggplot(data = nga_ngeo_clust.sf,
aes(x = cluster, y = pct_functional)) +
geom_boxplot()
Remarks :
The boxplot reveals Cluster 5 displays the highest mean of functional water points. This is followed by Cluster 1, 3, 2, and 4.
5.4.2 Visualise Multivariate
Usage of the code chunk below :
ggparcoord( ) - GGally - to reveal clustering variables by cluster.
nga_ngeo_clust.sf1 <- nga_ngeo_clust.sf %>%
select("shapeName",
"pct_functional",
"pct_nonFunctional",
"pct_unknown",
"pct_handPump",
"pct_tapStand",
"pct_uc300",
"pct_uc1000",
"pct_uc250",
"pct_urban0",
"cluster")
head(nga_ngeo_clust.sf1,3)Simple feature collection with 3 features and 11 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 7.307433 ymin: 5.052192 xmax: 13.83477 ymax: 13.71406
Projected CRS: Minna / Nigeria West Belt
shapeName pct_functional pct_nonFunctional pct_unknown pct_handPump
1 Aba North 41.17647 52.94118 5.882353 11.764706
2 Aba South 40.84507 46.47887 9.859155 9.859155
3 Abadam 0.00000 0.00000 0.000000 0.000000
pct_tapStand pct_uc300 pct_uc1000 pct_uc250 pct_urban0 cluster
1 0 17.64706 82.35294 0 0.000000 1
2 0 12.67606 87.32394 0 5.633803 1
3 0 0.00000 0.00000 0 0.000000 1
geometry
1 MULTIPOLYGON (((7.401109 5....
2 MULTIPOLYGON (((7.334479 5....
3 MULTIPOLYGON (((13.83477 13...
ggparcoord(data = nga_ngeo_clust.sf,
columns = c(2:19),
scale = "globalminmax",
alphaLines = 0.2,
boxplot = TRUE,
title = "Multiple Parallel Coordinates Plots of Variables by Cluster") +
facet_grid(~ cluster, scales = "fixed") +
theme(axis.text.x = element_text(angle = 30))
The parallel coordinate plot above reveals that households in Cluster 4 townships tend to own the highest number of TV and mobile-phone. On the other hand, households in Cluster 5 tends to own the lowest of all the five ICT.
Note that the scale argument of ggparcoor() provide several methods to scale the clustering variables. They are:
std: univariately, subtract mean and divide by standard deviation.
robust: univariately, subtract median and divide by median absolute deviation.
uniminmax: univariately, scale so the minimum of the variable is zero, and the maximum is one.
globalminmax: no scaling is done; the range of the graphs is defined by the global minimum and the global maximum.
center: use uniminmax to standardize vertical height, then center each variable at a value specified by the scaleSummary param.
centerObs: use uniminmax to standardize vertical height, then center each variable at the value of the observation specified by the centerObsID param
5.4.3 Compute Summary Statistics
nga_ngeo_clust.sf %>%
st_set_geometry(NULL) %>%
group_by(cluster) %>%
summarise(mean_pct_functional = mean(pct_functional),
mean_pct_nonFunctional = mean(pct_nonFunctional),
mean_pct_unknown = mean(pct_unknown),
mean_pct_handPump = mean(pct_handPump),
mean_pct_tapStand = mean(pct_tapStand),
mean_pct_uc300 = mean(pct_uc300),
mean_pct_uc1000 = mean(pct_uc1000),
mean_pct_uc250 = mean(pct_uc250),
mean_pct_urban0 = mean(pct_urban0))# A tibble: 5 × 10
cluster mean_pct_fun…¹ mean_…² mean_…³ mean_…⁴ mean_…⁵ mean_…⁶ mean_…⁷ mean_…⁸
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 51.7 29.0 9.25 35.0 0.525 44.1 45.5 0.539
2 2 46.2 42.2 11.3 59.9 1.00 70.6 28.3 1.03
3 3 40.5 49.4 9.15 9.87 0.304 19.2 80.4 0.345
4 4 18.4 14.6 67.0 18.1 0.337 72.1 27.5 0.410
5 5 78.9 20.7 0.295 88.7 0.0677 89.2 10.7 0.0903
# … with 1 more variable: mean_pct_urban0 <dbl>, and abbreviated variable names
# ¹mean_pct_functional, ²mean_pct_nonFunctional, ³mean_pct_unknown,
# ⁴mean_pct_handPump, ⁵mean_pct_tapStand, ⁶mean_pct_uc300, ⁷mean_pct_uc1000,
# ⁸mean_pct_uc250